Predicting in-hospital mortality after transcatheter aortic valve replacement using administrative data and machine learning

Transcatheter aortic valve replacement (TAVR) is the gold standard treatment for patients with symptomatic aortic stenosis. The utility of existing risk prediction tools for in-hospital mortality post-TAVR is limited due to two major factors: (a) the predictive accuracy of these tools is insufficient when only preoperative variables are incorporated, and (b) their efficacy is also compromised when solely postoperative variables are employed, subsequently constraining their application in preoperative decision support. This study examined whether statistical/machine learning models trained with solely preoperative information encoded in the administrative National Inpatient Sample database could accurately predict in-hospital outcomes (death/survival) post-TAVR. Fifteen popular binary classification methods were used to model in-hospital survival/death. These methods were evaluated using multiple classification metrics, including the area under the receiver operating characteristic curve (AUC). By analyzing 54,739 TAVRs, the top five classification models had an AUC ≥ 0.80 for two sampling scenarios: random, consistent with previous studies, and time-based, which assessed whether the models could be deployed without frequent retraining. Given the minimal practical differences in the predictive accuracies of the top five models, the L2 regularized logistic regression model is recommended as the best overall model since it is computationally efficient and easy to interpret.

www.nature.com/scientificreports/ able cardioverter defibrillator (ICD), peripheral vascular disease, cardiac arrhythmias, chronic kidney disease, anemia, pulmonary circulation disorder, atrial fibrillation and flutter, chronic pulmonary disease, liver disease, and coagulopathy; and to be admitted to an urban nonteaching hospital in the south. In addition, these patients were less likely to have a history of cancer, carotid artery disease, dyslipidemia, valvular disease, smoking, coronary artery bypass graft (CABG), coronary artery disease (CAD), myocardial infarction (MI), permanent pacemaker (PPM), percutaneous coronary intervention (PCI), and transient ischemic attack (TIA). A total of 43,791 (80%) and 10,948 (20%) patients were randomly assigned to the development and validation cohorts, respectively. Patient characteristics were similar between the development and validation cohorts. A total of 39,820 (74.5%) and 13,982 (25.5%) patients were split based on the time from 2012 to 2018 and 2019, respectively. The characteristics were similar between the two sets (see Supplementary Table S1 for the full list and question 2).
ML classifiers' performance for in-hospital mortality prediction based on preoperative information. Table 2 presents the baseline results from the dummy model along with the testing/holdout results of our top five classification models with varying (5/10/20/30/40/all) feature sets and different holdout datasets for our first (preoperative predictors and random sampling) and second (preoperative predictors and time-based sampling) research questions. Irrespective of the model, the AUC, balanced accuracy, sensitivity, precision, and F1 score improved as more features were used for prediction. Figure 1 depicts the improvement in the AUC as more features were made available to the model. The importance of the predictors varied primarily with the classification model used. Figure 2 (Supplementary Table S2) shows the ranked importance of the features for each model for research question 1. Overall, age was the most important feature for all models, except for the GBC model, which had fluid and electrolyte disorders as its top feature (the second most important feature for the other four models). The presence of liver disease, hypertension, peripheral vascular disease, dyslipidemia, cardiac arrhythmias, and smoking were among the features consistently picked as important by the models. www.nature.com/scientificreports/  www.nature.com/scientificreports/

Discussion
Herein, we proposed using publicly available administrative data to predict in-hospital mortality post-TAVR using 15 classification models. Comparing the results of our top 5 classification models with the baseline dummy classification benchmark, five observations were made. First, our classification metrics generally improved as more features were made available to the model, with modest improvements after the top 20-30 features. This indicates that the use of chi-square tests to select the top variables was suitable for our dataset since it did not contradict the results obtained when the boosting-based algorithms and ridge regression were used with all features. Second, all five models were able to provide "good" predictive results per 1 definition of an AUC ≥ 0.80 for both the random and time-based sampling scenarios (Table 2). These AUC values generally corresponded when ≥ 30 top features were available for the model. Third, it is important to note that the holdout results were consistent with our cross-validation results (see Supplementary Table S3), indicating no evidence of overfitting. Fourth, the LR model presented the best overall predictive performance for our two research questions since its AUC, balanced accuracy, sensitivity, precision, and F1 score were among the highest values observed across all models. For example, it had the highest balanced accuracy of 0.743 and 0.732 (i.e., the arithmetic mean of sensitivity and specificity). The other metrics were all significantly higher than those of the dummy classifier and practically equivalent to those of the best performing model for each metric. We recommend the LR model since it is computationally efficient and is the easiest to interpret among the five top models. Fifth, the precision and F1 scores for all the models were relatively low (approximately 10%); however, this was expected given the highly imbalanced nature of our dataset. Note that the dummy classifier returned a value of 0 with its default setting, and it would have returned a value approximately equal to our survival percentage of ~ 2% if the stratified input was selected 15 . Our feature importance in Fig. 2 is somewhat similar to the top features reported by 3 . However, four of their top five features (acute kidney injury, cardiogenic shock, cardiac arrest, and sepsis) were not available for selection by our models since they were all postsurgery complications. Only fluid and electrolyte disorders, which were their third most important feature, were available for selection by our model.
While several studies have examined the use of statistical and/or machine learning models for TAVR prediction 3,16-21 , few studies have examined in-hospital mortality 3,16,17 . In these papers, the reported AUC scores were (a) 0.66 based on 9 preoperative variables in 16 , (b) 0.92 based on a combination of pre-and postoperative predictors in 3 , and (c) not reported in 17 as logistic regression was used to compute the odds ratio for predictors of mortality in the adjusted analysis of patients who underwent TAVR with end-stage renal disease. Our original sample size of 54,739 was much larger than the 10,891, 20,540, and 6,836 patients used in the other studies. Furthermore, our AUC value of > 0.81 was significantly larger than the 0.66 reported in 16 . While it is less than the 0.92 reported in 3 , the differences are attributed to not including postoperative predictors; our top models had AUC values of 0.91-0.93 when postoperative predictors were available during model building (Supplementary  Table S4). We computed the performance of the models with postoperative data only to support our previous statement and to show that the models' predictive performance could reach the values reported in 3 (given that our dataset included low-risk patients who were not in their dataset, and the authors did not make their code available). www.nature.com/scientificreports/ This study demonstrated the feasibility of using solely the preoperative information available in administrative data to accurately predict in-hospital mortality post-TAVR. Our study represents the first report in which "good" predictive performance 1 could be achieved using solely preoperative, administrative predictors. Our results mark a substantial improvement, an increase in the AUC of approximately 0.15, over the results of 16 , while addressing the main limitation in 3 , i.e., "the inability to restrict variables pre-procedure versus post-procedure, which provides a dynamic nature to the NIS TAVR score. " Specifically, we showed that by removing any of the variables that could have occurred "post-procedure", a model could still have good predictive performance and quantify the impact of not including such variables on the predictive performance (a decrease in the AUC from 0.91-0.93 to 0.81). Our approach is consistent with the recommendation in 13 who stated that "for decision support algorithms to be implemented in clinical practice, we would expect them to be accurate and pertinent at the time a decision is taken. " Furthermore, we showed that the developed models may not need to be retrained often, since the predictive performance for the 2019 holdout dataset in question 2 was similar to that obtained from random sampling in our first research question. To our knowledge, this is the first study in which deploymentrelated questions were assessed with respect to TAVR operations. In our estimation, our time-based sampling approach presents a methodological approach to "subject [decision support models for clinical practice] to the test of time" per the recommendation of 13 .
Despite the examination of a relatively large number of machine learning models, our study showed that the predictive performance of an L2 regularized logistic regression model was equivalent to the results obtained using more complex machine learning models. This is also consistent with the results obtained in 13 using postoperative predictors. This confirmatory result can accelerate the use/deployment of logistic regression as a preoperative risk scoring tool for TAVR procedures. In our estimation, the use of a logistic-type model would be preferable in medical practice for four main reasons. First, it is an explainable model, i.e., when the coefficients are exponentiated, we can capture the change in odds when one predictor is increased by one unit, holding other predictors constant. Hence, this follows the recommendations of 22 , who recommended the use of interpretable models for high-stakes decisions, and 13 , who stated that "clinicians and patients should also be provided enough information to understand the process that led to the decision." Second, statistical tests for variable significance and model goodness-of-fit analyses can be performed to provide additional insight about the model. Third, due to its interpretability and good predictive performance, it is meaningful to not only look at the dichotomized prediction but also to extract the underlying survival/death probability from the model. We anticipate that reporting a preprocedural survival probability would inform clinical pathway determination and provide a structured, data-driven risk adjustment of expected outcomes. Fourth, regularized logistic regression can be performed using multiple software programs that are currently used in medical settings 23 .
While this study utilized only administrative, preoperative variables in model building, it is interesting to note that our reported predictive accuracy metrics were, at a minimum, similar to (if not exceeding) a large amount of the TAVR risk assessment literature in which clinical data were utilized (often for a slightly longer 30-day prediction period). In 2015, the American College of Cardiology (ACC) and the Society of Transthoracic Surgeons (STS) developed an in-hospital mortality risk score based on STS/ACC transcatheter valve therapy (TVT) registry data 16 . This risk score took into account the patient's baseline serum creatinine (sCr, mg/dL) level, dialysis status, New York Heart Association (NYHA) classification, urgency of the procedure, presence of severe lung disease, and type of access (femoral vs. nonfemoral) with a 30-day mortality AUC of 0.66. Since then, other TAVR-specific risk models have been developed to predict 30-day mortality, such as FRANCE-2 (AUC = 0.67) 18 , OBSERVANT (AUC = 0.71) 19 , and CoreValve U.S. (AUC = 0.75) 21 . More recently, a deep learning-based approach was used in 24 to predict cerebrovascular events (CVEs) post-TAVR using both clinical and imaging data. Their approach resulted in an AUC of 0.79, and they showed that CVEs increased the odds of death by 2.62 and were most likely to occur on the first day post-TAVR. Based on the aforementioned studies, we conclude that our findings are informative since we showed that the use of administrative, preoperative variables with a simple L2 logistic regression model was sufficient to predict in-hospital mortality (with results similar to those of state-ofthe-art studies that utilized clinical data, albeit for a 30-day prediction period).
In our estimation, there are three scenarios that can be used to deploy our model in practice. First, practitioners may want to utilize our developed models as is. To assist them in such an instance, we have developed a web app 25 where they can input the values for the predictor variables based on their patient, and we return both the predicted outcome and the associated probability for survival. Note that the model deployed in the app is trained on the entire NIS data sample based on the recommendation in 26 . Second, the model can be deployed as is using a different snapshot of the NIS database. In this case, we recommend following our five-step approach highlighted in the central illustration. Once satisfactory predictive performance is achieved, the best model should be retrained on all the data prior to model deployment 26 . To assist practitioners in such a scenario, we provide our code in 27 , which they can reuse for their dataset. Third, in large hospital system settings, there may be access to pertinent clinical predictors in addition to NIS variables. Penalized logistic regression (e.g., LASSO, ridge, elastic net, etc.) can be used to model such data. The research question in such a case would be whether clinical data would provide more information when compared to the out-of-network patients whose data would be deleted due to the absence of clinical data.
Limitations. There are several limitations in this study that need to be highlighted. First, our models were based on the HCUP NIS database. The database was not designed for clinical decision support, and the derivation of clinical information from ICD codes is a limitation since "some nonrelated clinical diagnoses may be omitted and may not represent the true prevalence of risk factors" 3 , and the encoding of such raw health data may be inconsistent across hospitals/providers/time. Second, our models' predictions were limited to in-hospital mortality. While the post-TAVR survival probability should monotonically decrease over time, the decision to www.nature.com/scientificreports/ operate on a patient is based on a longer survival time frame and clinical data that were not observed in our study. Third, innovations in TAVR procedures (e.g., an increased prevalence of robot-assisted surgeries) and pre/ postprocedural care were not captured in our analyses. Such innovations can significantly improve survival outcomes and deem the historical data used in model training obsolete. Thus, incorporating this domain expertise in training/retraining our statistical/machine learning models would be an important consideration 28 if they are deployed for decision support.

Conclusions
Despite the complexity of TAVR procedures and the variability in patient mix, post-TAVR survival and death can be somewhat predicted using only administrative, preoperative data and several standard statistical/machine learning models. Our study illustrates that administrative data can be used to predict and/or risk adjust complex medical procedures, such as TAVR, without the need for frequent retraining.

Methods
Data source. The dataset used was acquired from the NIS/HCUP database 5 . The unit of analysis was the discharge record. ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification) codes 3505 and 3506 were used to identify all patients ≥ 18 years who underwent a TAVR procedure between January 01, 2012, and September 30, 2015. Furthermore, the ICD-10-CM codes 02RF4xx and 02RF3xx were used to identify all patients ≥ 18 years who underwent a TAVR procedure between October 01, 2015, and December 31, 2019. A total of 54,739 TAVR records were obtained using the aforementioned ICD codes, filtering nonadult patients and removing missing data for age, race, sex, income, elective surgery, and in-hospital mortality. The data were divided into two groups: those who survived the procedure (alive; n = 53,626) and those who died during the same hospitalization (deceased; n = 1113). For each procedure, the ICD-9-CM (prior to October 01, 2015) or ICD-10-CM (starting from October 01, 2015) was used to identify comorbidities and the TAVR approach (see Supplementary Table S5 for utilized codes).
Ethical approval. Per the HCUP site 5 , "HCUP databases conform to the definition of a limited data set.
A limited data set is healthcare data in which 16 direct identifiers, specified in the Privacy Rule, have been removed. Under HIPAA [the Health Insurance Portability and Accountability Act], review by an institutional review board (IRB) is not required for use of limited data sets. " Study design. Figure 3 shows the workflow of this study from data extraction to the use of machine learning techniques to address our two research questions: (a) the utility of NIS preoperative variables alone in predicting TAVR survival and (b) the deployment of such predictive models without frequent retraining. The workflow consisted of five major steps. First, we extracted all TAVR procedures that occurred in 2012-2019 from the NIS database using SAS software (version 9.4, SAS Institute Inc., USA). Second, Python 3.9 was used to prepare the data into a tabular dataset for machine learning, i.e., generate the predictor set that would be used to predict TAVR outcomes. The predictors were divided into patient demographics (age, sex, race, pay information and ZIP code quartile), hospital information (region, bed size, urban/rural/teaching hospital, etc.), and binary indicators of comorbidities. The dataset had 54,739 rows/procedures and 45 columns/variables. The last three steps of training, evaluating, and interpreting the machine learning models were performed separately for each question.
For our first research question, the TAVR dataset was randomly split into an 80% training dataset (n = 43,791, with 42,906 alive and 885 deceased individuals) and a 20% test dataset (n = 10,948, with 10,720 alive and 228 deceased individuals). On the other hand, for the second research question, the training set contained 40,757 procedures from 2012 to 2018 (39,820 alive and 937 deceased individuals), and the test dataset contained 13,982 procedures from 2019 (13,806 alive and 176 deceased individuals). The training-test split for the second question was 74.5-25.5%.
Given the imbalance between the living and deceased patients in both training samples, we examined the use of random undersampling, random oversampling, and combined resampling to create balanced training datasets 29 using the imbalance-learn Python library (version 0.9.1). Based on our preliminary analyses, random oversampling resulted in the best prediction performance and hence was used. The resulting training sizes for questions 1 and 2 were 85,812 and 79,640, respectively, each containing an equal number of living and deceased patients.
Similar to 3 , a feature ranking approach was used to examine the top 5/10/20/30/40/all features as inputs in our machine learning models. While the use of external variable/feature selection is not optimal for machine learning models with built-in feature selection 30 , we used this approach to be consistent with 3 since some of our examined models did not have a built-in feature selection technique (e.g., support vector machines). The external feature selection used was the "classic" method 31 , with a 0.80 threshold, from the PyCaret (version 2. 3.6) Python library 26 .
Using PyCaret, stratified fivefold cross-validation was used to train 15 popular binary classification models for question 1 and 2 training datasets with the aforementioned 5/10/20/30/40/all top features. The 15 models included: (a) traditional statistical models: logistic regression with an L2 penalty (hereafter denoted as LR for conciseness), ridge regression, linear discriminant analysis (LDA), quadratic discriminant analysis, and naïve Bayes; (b) single machine learning classifiers: support vector machines with a linear kernel, k-nearest neighbor classifiers, and decision trees; and (c) ensemble classifiers: gradient boosting classifier (GBC), light gradient boosting machine (LightGBM), CatBoost, Ada Boost classifier, extreme gradient boosting, random forests, and   www.nature.com/scientificreports/ strategy for the dummy classifier, i.e., prior, which predicted the most frequent class in our training set for all test samples without regard to features. This allowed us to understand the predictive gains obtained from using our administrative features and machine learning models when compared to a dummy classifier. Note that the baselining in a regression problem is somewhat similar since the r 2 metric captures the improvement in predictive performance compared to just using a dummy model (with the average of the response for prediction irrespective of the values of any potential features). The five classification models were trained in PyCaret for each of the sets of features and questions. The parameters of the tuned classification models are described in Supplementary Table S6. Furthermore, the dummy model was trained once for each question since it predicted the majority class (i.e., survival post-TAVR for all patients). All of the models were evaluated on the separate (i.e., step 4, not part of training) test sets for questions 1 and 2 using the following performance measures 35,36 : accuracy, AUC, balanced accuracy, sensitivity (recall), specificity, precision (i.e., positive predictive value (PPV)), negative predictive value (NPV), and F1 score. For the sake of conciseness, we did not further describe these models. We refer the reader to the scikitlearn documentation 36 for a detailed introduction to LR, LDA, and GBC. Similarly, the LightGBM and CatBoost documentation are available from their respective frameworks 37,38 .
In the fifth step of our workflow, we utilized PyCaret to create diagnostic plots of each model's performance. Due to space limitations, we only show the feature importance plot in this paper.
Statistical analysis. Following the approach of 3 , a two-tailed t test was used to compare the differences within continuous variables, and chi-square tests were utilized for categorical data. These tests were performed using Minitab software (version 19, Minitab Inc., USA), and p < 0.05 was considered statistically significant. The performance of the classification models was assessed using the AUC; however, we also reported other metrics, including accuracy, balanced accuracy, sensitivity/recall, specificity, and precision, as is customary in the literature 29,34 . The training and evaluation of the models were performed using the PyCaret library 26 in Python.

Data availability
The NIS can be purchased from the U.S. Agency for Healthcare Research and Quality (AHRQ). Per their data usage agreement 5 , "I will not redistribute HCUP data by posting on any website or publishing in any other publicly accessible online repository. If a journal or publication requests access to data or analytic files, I will cite restrictions on data sharing in this Data Use Agreement and direct them to AHRQ HCUP (www. hcup-us. ahrq. gov) for more information on accessing HCUP data. "